include_graphics("brightness.png")
An arrangement of artworks according to their average pixel brightness, as given in the reading.
include_graphics("dimred-animation.gif")
The dimensionality reduction algorithm in this animation converts a large number of raw features into a position on a one-dimensional axis defined by average pixel brightness. In general, we might reduce to dimensions other than 1D, and we will often want to define features tailored to the dataset at hand.
Linear dimensionality reduction using PCA.
What is this object?
Not so complicated now
cocktails_df <- read_csv("https://uwmadison.box.com/shared/static/qyqof2512qsek8fpnkqqiw3p1jb77acf.csv")
cocktails_df[, 1:6]
## # A tibble: 937 x 6
## name category light_rum lemon_juice lime_juice sweet_vermouth
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Gauguin Cocktail Clas~ 2 1 1 0
## 2 Fort Lauderda~ Cocktail Clas~ 1.5 0 0.25 0.5
## 3 Cuban Cocktai~ Cocktail Clas~ 2 0 0.5 0
## 4 Cool Carlos Cocktail Clas~ 0 0 0 0
## 5 John Collins Whiskies 0 1 0 0
## 6 Cherry Rum Cocktail Clas~ 1.25 0 0 0
## 7 Casa Blanca Cocktail Clas~ 2 0 1.5 0
## 8 Caribbean Cha~ Cocktail Clas~ 0.5 0 0 0
## 9 Amber Amour Cordials and ~ 0 0.25 0 0
## 10 The Joe Lewis Whiskies 0 0.5 0 0
## # ... with 927 more rows
pca_rec object below defines a tidymodels recipe for performing PCA. Computation of the lower-dimensional representation is deferred until prep() is called. This delineation between workflow definition and execution helps clarify the overall workflow, and it is typical of the tidymodels package.pca_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_prep <- prep(pca_rec)
The step_normalize call is used to center and scale all the columns. This is needed because otherwise columns with larger variance will have more weight in the final dimensionality reduction, but this is not conceptually meaningful. For example, if one of the columns in a dataset were measuring length in kilometers, then we could artificially increase its influence in a PCA by expressing the same value in meters. To achieve invariance to this change in units, it would be important to normalize first.
We can tidy each element of the workflow object. Since PCA was the second step in the workflow, the PCA components can be obtained by calling tidy with the argument “2.” The scores of each sample with respect to these components can be extracted using juice. The amount of variance explained by each dimension is also given by tidy, but with the argument type = "variance". We’ll see how to visualize and interpret these results in the next lecture.
tidy(pca_prep, 2)
## # A tibble: 1,600 x 4
## terms value component id
## <chr> <dbl> <chr> <chr>
## 1 light_rum 0.163 PC1 pca_9LiUc
## 2 lemon_juice -0.0140 PC1 pca_9LiUc
## 3 lime_juice 0.224 PC1 pca_9LiUc
## 4 sweet_vermouth -0.0661 PC1 pca_9LiUc
## 5 orange_juice 0.0308 PC1 pca_9LiUc
## 6 powdered_sugar -0.476 PC1 pca_9LiUc
## 7 dark_rum 0.124 PC1 pca_9LiUc
## 8 cranberry_juice 0.0954 PC1 pca_9LiUc
## 9 pineapple_juice 0.119 PC1 pca_9LiUc
## 10 bourbon_whiskey 0.0963 PC1 pca_9LiUc
## # ... with 1,590 more rows
juice(pca_prep)
## # A tibble: 937 x 7
## name category PC1 PC2 PC3 PC4 PC5
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Gauguin Cocktail Classics 1.38 -1.15 1.34 -1.12 1.52
## 2 Fort Lauderdale Cocktail Classics 0.684 0.548 0.0308 -0.370 1.41
## 3 Cuban Cocktail No.~ Cocktail Classics 0.285 -0.967 0.454 -0.931 2.02
## 4 Cool Carlos Cocktail Classics 2.19 -0.935 -1.21 2.47 1.80
## 5 John Collins Whiskies 1.28 -1.07 0.403 -1.09 -2.21
## 6 Cherry Rum Cocktail Classics -0.757 -0.460 0.909 0.0154 -0.748
## 7 Casa Blanca Cocktail Classics 1.53 -0.392 3.29 -3.39 3.87
## 8 Caribbean Champagne Cocktail Classics 0.324 0.137 -0.134 -0.147 0.303
## 9 Amber Amour Cordials and Lique~ 1.31 -0.234 -1.55 0.839 -1.19
## 10 The Joe Lewis Whiskies 0.138 -0.0401 -0.0365 -0.100 -0.531
## # ... with 927 more rows
tidy(pca_prep, 2, type = "variance")
## # A tibble: 160 x 4
## terms value component id
## <chr> <dbl> <int> <chr>
## 1 variance 2.00 1 pca_9LiUc
## 2 variance 1.71 2 pca_9LiUc
## 3 variance 1.50 3 pca_9LiUc
## 4 variance 1.48 4 pca_9LiUc
## 5 variance 1.37 5 pca_9LiUc
## 6 variance 1.32 6 pca_9LiUc
## 7 variance 1.30 7 pca_9LiUc
## 8 variance 1.20 8 pca_9LiUc
## 9 variance 1.19 9 pca_9LiUc
## 10 variance 1.18 10 pca_9LiUc
## # ... with 150 more rows
library("ggplot2")
library("ggrepel")
library("readr")
library("stringr")
library("tidymodels")
library("tidytext")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
# produced by code in previous notes
components <- read_csv("https://uwmadison.box.com/shared/static/dituepd0751qqsims22v2liukuk0v4bf.csv") %>%
filter(component %in% str_c("PC", 1:5))
scores <- read_csv("https://uwmadison.box.com/shared/static/qisbw1an4lo8naifoxyu4dqv4bfsotcu.csv")
variances <- read_csv("https://uwmadison.box.com/shared/static/ye125xf8800zc5eh3rfeyzszagqkaswf.csv") %>%
filter(terms == "percent variance")
ggplot(variances) +
geom_col(aes(component, value))
Proportion of variance explained by each component in the PCA.
ggplot(components, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text = element_text(size = 7))
The top 5 principal components associated with the cocktails dataset.
reorder() or releveling the associated factor, will reorder all the facets in the same wayreorder_within function coupled with scale_*_reordered, both from the tidytext packagecomponents_ <- components %>%
filter(component %in% str_c("PC", 1:3)) %>%
mutate(terms = reorder_within(terms, abs(value), component))
ggplot(components_, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ component, scales = "free_y") +
scale_y_reordered() +
labs(y = NULL) +
theme(axis.text = element_text(size = 7))
The top 3 principal components, with defining variables sorted by the magnitude of their coefficient.
ggplot(scores, aes(PC1, PC2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 1.5) +
geom_text_repel(check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) # rescale axes to reflect variance
The scores associated with the cocktails dataset.
pca_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_prep <- prep(pca_rec)
# split name and category out of the data frame
pca_result <- cocktails_df %>%
select(-name, -category) %>%
scale() %>%
princomp()
# join them back into the PCA result
metadata <- cocktails_df %>%
select(name, category)
scores_direct <- cbind(metadata, pca_result$scores)
ggplot(scores_direct, aes(Comp.1, Comp.2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 1.5) +
geom_text_repel(check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) # rescale axes to reflect variance
A plot of the PCA scores made without using tidymodels.
name and category variables to id roles, so that all_predictors() knows to skip them.library("ggplot2")
library("readr")
library("tidymodels")
library("embed")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
moons <- read_csv("https://uwmadison.box.com/shared/static/kdt9qqvonhcz2ssb599p1nqganrg1w6k.csv")
ggplot(moons, aes(X, Y, col = Class)) +
geom_point() +
scale_color_brewer(palette = "Set2")
An example nonlinear dataset where projections onto any straight line will necessarily cause the classes to bleed together.
UMAP (and many other nonlinear methods) begins by constructing a graph in the high-dimensional space, whose layout in the lower dimensional space will ideally preserve the essential relationships between samples.
When using fewer nearest neighbors, the final dimensionality reduction will place more emphasis on effectively preserving the relationships between points in local neighborhoods.
When using larger neighborhoods, UMAP will place more emphasis on preserving global structure, sometimes at the cost of local relationships between points.
In R, we can implement this using almost the same code as we used for PCA. The step_umap command is available through the embed package.
cocktails_df <- read_csv("https://uwmadison.box.com/shared/static/qyqof2512qsek8fpnkqqiw3p1jb77acf.csv")
umap_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_umap(all_predictors(), neighbors = 20, min_dist = 0.1)
umap_prep <- prep(umap_rec)
ggplot(juice(umap_prep), aes(umap_1, umap_2)) +
geom_point(aes(color = category), alpha = 0.7, size = 0.8) +
geom_text(aes(label = name), check_overlap = TRUE, size = 3, hjust = "inward")
The learned UMAP representation of the cocktails dataset.
library("embed")
library("ggplot2")
library("ggrepel")
library("readr")
library("stringr")
library("tidymodels")
library("tidytext")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
set.seed(479)
These notes don’t introduce any new conceptual material. Instead, they give a few examples of how PCA and UMAP can be used.
moons <- read_csv("https://uwmadison.box.com/shared/static/kdt9qqvonhcz2ssb599p1nqganrg1w6k.csv")
ggplot(moons, aes(X, Y, col = Class)) +
geom_point() +
scale_color_brewer(palette = "Set2")
The original two moons dataset. We will ask both PCA and UMAP to recover a 1D reduction of these 2D data.
moons_ <- recipe(~ ., data = moons) %>%
update_role(Class, new_role = "id")
pca_rec <- step_pca(moons_, all_predictors(), num_comp = 1)
umap_rec <- step_umap(moons_, all_predictors(), num_comp = 1)
scores <- bind_cols(
prep(umap_rec) %>% juice() %>% mutate(umap_1 = scale(umap_1)),
prep(pca_rec) %>% juice() %>% select(-Class)
) %>%
pivot_longer(-Class, names_to = "method")
ggplot(scores, aes(value, method, col = Class)) +
geom_point(position = position_jitter(h = 0.1), alpha = 0.8) +
scale_color_brewer(palette = "Set2")
1D PCA and UMAP representations of the 2D two moons dataset.
taxa variableantibiotic <- read_csv("https://uwmadison.box.com/shared/static/t1lifegdz8s0a8lgckber32ytyh9hu4r.csv")
taxa <- read_csv("https://uwmadison.box.com/shared/static/ng6y6etk79lrm0gtsgw2u0yq6gqcozze.csv")
step_logantibiotic_ <- recipe(~ ., data = antibiotic) %>%
update_role(sample:antibiotic, new_role = "id") %>%
step_log(all_predictors(), offset = 1) %>%
step_normalize(all_predictors())
pca_rec <- step_pca(antibiotic_, all_predictors())
pca_prep <- prep(pca_rec)
scores <- juice(pca_prep)
variances <- tidy(pca_prep, 2, type = "variance")
ggplot(scores, aes(PC1, PC2, col = antibiotic)) +
geom_point(aes(shape = ind), size = 1.5) +
geom_text_repel(aes(label = sample), check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) +
scale_color_brewer(palette = "Set2")
PCA scores for the antibiotics dataset. The main difference is between study participants, with some secondary variation related to whether the participant was taking the antibiotic at that timepoint.
components_ <- tidy(pca_prep, 3) %>%
filter(component %in% str_c("PC", 1:6)) %>%
mutate(terms_ = reorder_within(terms, abs(value), component)) %>%
group_by(component) %>%
top_n(20, abs(value)) %>%
left_join(taxa)
ggplot(components_, aes(value, terms_, fill = Phylum)) +
geom_col() +
facet_wrap(~ component, scales = "free_y") +
scale_y_reordered() +
labs(y = NULL) +
scale_fill_brewer(palette = "Set1") +
theme(axis.text = element_text(size = 5))
The first six principal components associated with the antibiotics dataset.
components with which to interpret the different axesumap_rec <- step_umap(antibiotic_, all_predictors(), min_dist = 1.5)
umap_prep <- prep(umap_rec)
scores <- juice(umap_prep)
ggplot(scores, aes(umap_1, umap_2, col = antibiotic)) +
geom_point(aes(shape = ind), size = 1.5) +
geom_text_repel(aes(label = sample), max.overlaps = 10) +
scale_color_brewer(palette = "Set2")
The UMAP representation associated with the antibiotics dataset.
fashion <- read_csv("https://uwmadison.box.com/shared/static/aur84ttkwa2rqvzo99qo7yhxemoc6om0.csv") %>%
sample_frac(0.2) %>%
mutate(
image = row_number(),
label = as.factor(label)
)
fashion_ <- recipe(~ ., data = fashion) %>%
update_role(label, image, new_role = "id")
pca_rec <- step_pca(fashion_, all_predictors())
pca_prep <- prep(pca_rec)
scores <- juice(pca_prep) %>%
rename(x = PC1, y = PC2)
ggplot(scores, aes(x, y, col = label)) +
geom_point() +
scale_color_brewer(palette = "Set3") +
coord_fixed()
Principal component scores from the fashion dataset.
pivot_scores(scores, fashion) %>%
overlay_images()
A subset of principal component scores expressed as the corresponding images.
umap_rec <- step_umap(fashion_, all_predictors(), num_comp = 2, min_dist = 0.5)
umap_prep <- prep(umap_rec)
scores <- juice(umap_prep) %>%
rename(x = umap_1, y = umap_2)
ggplot(scores, aes(x, y, col = label)) +
geom_point() +
scale_color_brewer(palette = "Set3") +
coord_fixed()
The locations of all the images according to UMAP. Each color is a different class.
pivot_scores(scores, fashion, scale_factor = 0.05) %>%
overlay_images(scale_factor = 0.05)
A sample of the images at the locations determined by UMAP.